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Abstract 

In this paper we provide a complete algebraic characterization of 
the model implied by a Bayesian network with latent variables when 
the observed variables are discrete. We show that it is algebraically 
equivalent to the so-called nested Markov model, meaning that the two 
are the same up to inequality constraints on the joint probabilities. 
The nested Markov model is therefore the best possible approximation 
to the latent variable model whilst avoiding inequalities, which are ex¬ 
tremely complicated in general. Latent variable models also suffer from 
difficulties of unidentifiable parameters and non-regular asymptotics; 
in contrast the nested Markov model is fully identifiable, represents a 
curved exponential family of known dimension, and can easily be fitted 
using an explicit parameterization. 


1 Introduction 


Directed acyclic graph (DAG) models, also known as Bayesian networks, are 
widely used multivariate models in probabilistic reasoning, machin e learn¬ 
ing and causal inference ( BishorJ. 2007 : Darwiche . 20091 : Pearl . 2009k These 
models are defined by simple factorizations of the joint distribution, and in 
the case of discrete or jointly Gaussian random variables, are curved expo¬ 
nential families of known dimension. The inclusion of latent variables within 
Bayesian network models can greatly increase their flexibility, and also ac¬ 
count for unobserved confounding. However, this flexibility comes at the cost 
of creating models that are very complex, and that are not easy to explicitly 
describe when considered as marginal models over the observed variables. 
Latent variable models generally do not have fully identifiable parameteriza- 
tions, and contain singularities that lead to non-regular asymptotics (lDrto n. 
2009a). In addition, using them may force a modeller to specify a paramet¬ 
ric structure over the latent variables, introducing additional assumptions 
that are generally difficult to test and may be unreasonable. 

If no parametric assumptions are made about the latent variables, and no 
assumption is made about its state-space, this leads to an implicitly defined 
marginal model. The marginal DAG model has the advantage of avoiding 
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Figure 1: A directed acyclic graph on five vertices. 


some of the assumptions made by a parametric latent variable model, but 
no explicit characterization of the model is available and nor is there any 
obvious method for fitting it to data. 

Example 1.1. Consider the DAG on five vertices shown in Figure [1] The 
graph represents a multivariate model over five random variables Xq , X \, X 2 , A 3 , X 4 
with the restriction that the joint distribution factorizes as 

p(x 0 ,X 1 ,X2,X3,X 4 ) =p(x 0 ) -p(x 1 ) -p(x 2 \x 0 ,x 1 ) ■ p(x 3 \x 2 ) ' p(x 4 \ X 0 , X 3 ) ; 

here, for example, p(x 3 | x 2 ) represents the conditional density of A 3 given 
A 2 . If we treat Ao as a latent variable, the marginal model over the re¬ 
maining observed variables (Ai, A 2 , A 3 , X 4 ) is the collection of probability 
distributions that can be written in the form 


P(x 1,X2,X 3 ,X 4 ) 


= p(x 0 ) -pix!) ■p(x 2 \x 0 ,X 1 ) -P(X3\X2) -p(x 4 \xo,X3)dx 0 . ( 1 ) 

Jx 0 

That is, any (Ai, A 2 , A 3 , A 4 )-margin of a distribution which factorizes ac¬ 
cording to the DAG over all five variables, for any state-space or distribution 
of A ( Q 

From either of the displayed equations above we can deduce that A 3 X 
Ai | A 2 , so this constraint holds in the marginal model. In other words, 
the conditional distribution p(x 3 | x 4 ,x 2 ) does not depend upon x±. In addi¬ 
tion this model satisfies the so-called Verma constraint of Verrna and Pearl 


(1991), because the expression 


/ p(x 2 \xi) ■p(x 4 \xi,X2,x 3 )dx 2 (2) 

Jx 2 

does not depend upon x 4 (see Example 13.21) . 

If the four observed variables are binary, the set of distributions satisfying 
the independence and Verma constraints is an 11-dimensional subset of the 
15-dimensional probability simplex. It is not immediately clear whether 
or not this set is the same as the marginal model, since in principle there 

1 In fact, without loss of generality we can assume Xq is uniform on (0, 1) 
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might be other restrictions. In other words, can any distribution satisfying 
the constraints can be written in the form ([1])? In this paper we will show 
that the constraints are sufficient to describe the model, up to inequalities. 
The marginal model is indeed 11-dimensional, and is algebraically defined 
by the equalities discussed above. 

Existing approaches to this pr oblem include the ancestral graph models 
of Richardson and Soirtesl ( 20021 1 and the equ i valen t models on acyclic di¬ 
rected mixed graphs (ADMGs) o f Richardsonl (|2003h . The Markov property 
considered by Richardson ( 2003h considers only conditional independence 
constraints and, in general, defines a strictly larger model than any latent 
variable model: we call this the ordinary Markov model. These models are 
the basis of the FCI algorithm for causal discovery in the presence of hidden 


variables ( Soirtes et ah . 200flh . 


The more refined nested Markov property ( Shpitser et al. . 20141 ) for AD¬ 


MGs accounts for both conditional independences and Verma constraints. 
Though the nested model is smaller than the ordinary Markov model, it is 
known still to be strictly larger than the marginal model of interest b ecau se 
marginal models are subject to inequality constraints (Pearl, 199.4 Evans . 
2012 ). 


1.1 Contribution 

In this paper we show that marginal models with finite discrete observed 
variables are always algebraically fully described by the nested Markov prop¬ 
erty, in the sense that the Zariski closures of the marginal model and the 
nested model are the same. 

A consequence of this is that a margin of a DAG model and its nested 
counterpart have the same dimension, and they differ only by inequality con¬ 
straints. The situation is represented by Figure [2J which shows the marginal 
model lying strictly within the nested model, but the two sharing a tangent 
space at some point pq. This means that we have, for the first time, a full 
algebraic characterization of margins of Bayesian network models. 

It also means that the nested model represents a sensible and practical 
approximation to the marginal model: inequality constraints are typically 
extremely complicated, so the nested model with its factorization criterion, 
separation criteria, and discrete parameterization, make it much easier to 
work with. The parameterizati on means that nested models can easily be 
fitted with existing algorithms ( Evans and Richardson . 201()l ) . In addition, 
the nested model is regular whenever the joint distribution is positive, so in 
a suitable sense it has better statistical properties than the marginal model. 


2 In fact, additional inequalities are present in this example, so strictly the answer to 
the question is ‘no’. 

3 The models are equivalent in the context we consider, but not if selection variables 
are present. 
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Figure 2: Diagramatic representation of the marginal model (Ai) sitting 
strictly within the nested model (JV), but sharing the same tangent cone 
TCq at the uniform distribution po- 


In principle causal discovery algorithms such as the FCI algorithm, which 
currently only make use of conditional independence constraints, could be 
extended to nested models. Our main result tells us that the nested model 
gives us close to maximum power to distinguish between different causal 
structures, without making additional assumptions. 

We work with a class of hyper-graphs_called mDAGs, with which we as¬ 
sociate the marginal models of DAGs ( Evansl . 2014 1. The remainder of the 
paper is organized as follows: Section [2] introduces DAG models, their mar¬ 
gins and mDAGs, and carefully defines the problem of interest. Section 
[3] describes the nested Markov property, and Section [I] gives an outline of 
the main result. Section 0 describes reductions which can be made to the 
state-space of the latent variables models without loss of generality, and 
Section [6] the main results of the paper. Finally in Section [7] we show that 
a large class of marginal models represent smooth manifolds, and provide 
some discussion. 


2 Directed Graphical Models 

We begin with some elementary graphical definitions. 

Definition 2.1. A directed graph , G(V, £), consists of a finite set of vertices, 
V, and a collection of edges, £, which are ordered pairs of distinct elements 
of V. If (v, w) € £ we denote this by v —>• w, and say that v is a parent of 
w; the set of parents of w is denoted pa g(w). Similarly w is a child of v, 
and the child sets are denoted ch g{v). 

A directed graph is acyclic if there is no sequence of vertices ui —»• U 2 —t 
■ ■ ■ — > Vk — > v\ for k > 1. We call such a graph a directed acyclic graph , or 
DAG. 
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Figure 3: A directed acyclic graph on three random and two fixed vertices. 


Graphs are best understood visually: an example of a DAG with five ver¬ 
tices and five edges is given in Figure [1] 

We will require a very slight generalization of a DAG which introduces a 
second type of node. 

Definition 2.2. A conditional DAG Q(V,W,£) is a DAG with vertices 
VUW and edge set £, with the restriction that no vertex in W may have 
any parents. The vertices in V are the random vertices, and W the fixed 
vertices ; these two sets are disjoint. 

If W = 0, this reduces to the ordinary definition of a DAG. We denote 
fixed vertices with square nodes, and random ones with round nodes: see 
the example in Figure [3j 


2.1 Graphical Models 

A graphical model arises from the identification of a grap h with a collec¬ 
tion of multivariate probability distributions; see Lauritzen (l i)9(i i for an 
introduction. We associate each vertex v £ V U W with a random variable 
X v taking values in a finite state-space X v . With a conditional DAG Q we 
associate some conditional probability measure P on Xy = x v£ yX v given 
Xw = x m6 h'X m ; this distribution is subject to constraints determined by 
the structure of the graph. 


Definition 2.3. Let P be a conditional probability distribution over Xy 
given Xw with conditional density p. We say that p obeys the factorization 
criterion with respect to a DAG Q if it factorizes into univariate conditional 
densities p v , v € V as 


p(xy | Xw) = P(x v | Zpa(u)), Xyw £ Xyw■ (3) 

vGV 

The definition reduces to the familiar factorization criterion for DAGs if 
W = 0. The extra generality will be useful for discussing Markov properties 
which involve factorization of the distribution into conditional pieces. The 
fixed vertices represent variables that have been conditioned upon; p satisfies 
(|3j) if and only if, after renormalization, it also satisfies the factorization 
criterion for the same DAG with all vertices random. 
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The definition of a Bayesian network can be extended to the case where no 
joint density exists by insisting that each random variable X v can be written 
as a measurable function of Xp a p,) and an independent noise variable; we call 
this the structural equation property. If the density exists the two criteria are 
equivalent, and since we work with discrete variables this condition is always 
satisfied. Although the factorization property is often simpler to work with 
for practical purposes such as modelling and fitting, the structural equation 
property is useful in proofs. The well-known global Markov property based 
on d-separation is also equivalent to the structural equation property (Pearl, 

200flh . 


Example 2.4. A distribution P with density p obeys the factorization cri¬ 
terion for the graph in Figure |T] if the density has the form 

p(x 0,Xi,X 2 ,X3,Xi) =p(x 0 ) -p{x i) ■p(x 2 \x 0 ,x 1 ) ■ p{x 3 \ X 2 ) • p(®4 | X 0 , X 3 ). 


Such distributions are precisely those which satisfy the conditional indepen¬ 
dences 


X\ X X Q , X 3 X Xo, X 1 I X- 2 , x 4 X x u x 2 1 x 0 , X 3 . 

Example 2.5. A conditional density obeys the factorization criterion for 
the conditional DAG in Figure 0 if it can be written as 

p(x 0 ,x 2,^4 | Xl,X 3 ) =p(x 0 ) ■p(x 2 jx 0 ,x 1 ) ■ p(x 4 | X 0 , X 3 ). 


Latent Variables 

We now introduce the possibility that some of the random vertices are unob¬ 
served, or latent. This leads to a model defined by integrating a factorization 
of the form above over the latent variables to obtain a marginal distribution. 

Definition 2.6. Let Q be a conditional DAG with fixed vertices V U U, 
and random vertices W. A conditional density p(xy \ xw) is in the V- 
marginal DAG model for Q if there exists a density q(xy, xjj \ xw) such that 
q factorizes according to Q , and 

p{x v \x w )= / q(xv, xu | x w ) dxu. 

J X-u 

That is, the margin of q over V is p. 

Note that in principle this definition can be altered so as not to require the 
existence of a density; however, since the observed variables are all discrete, 
and the latent variables will be assumed to be independent of each other, 
assuming the existence of a density does incur any loss of generality. 
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Figure 4: An mDAG representing the DAG in Figure [H with the vertex 0 
treated as unobserved. 


2.2 mDAGs 


We will represent the collection of marginal models defined by DAGs using 
a larger class of graphical models called mDAGs (‘marginal DAGs’). These 
avoid dealing with latent variables directly, instead introducing additional 
edges to represent them. For example, the DAG in Figure [lj with the vertex 
0 treated as a latent variable, is represented by the mDAG in Figure [H 


Definition 2.7. An mDAG , Q(V, W, £, B), is hyper-graph consisting of a 
conditional DAG with random vertices V, fixed vertices W and directed edge 
set £, together with a collection of bidirected hyper-edges B: the elements 
of B are inclusion maximal subsets of V, each of size at least two. 


The mDAG was introduced by Evansl (2014), without the additional gen¬ 
erality of fixed vertices. This aspect changes very little to the theory of 
these graphs, but is necessary for understanding the nested Markov model. 
As with conditional DAGs, when representing mDAGs graphically the fixed 
vertices are drawn as square nodes and random vertices as circles. Bidi¬ 
rected edges are drawn in red, as in Figure [5]( a); in this case W = {6} and 
23 = {{1,2}, {2,3,4}, {3,4,5}}. 

With each mDAG, Q, we can associate a conditional DAG Q by replacing 
each bidirected edge B £ B with a new random vertex u, such that the 
children of u are precisely the vertices in B. The new vertex u becomes 
the ‘unobserved’ variable represented by the bidirected edge B. We call Q 
the canonical DAG associated with Q. The mDAG in Figure Oa) is thus 
associated with the canonical DAG in Figure 0(b). 

Our interest in mDAGs lies in their representation of the margin of the 
associated canonical DAG, and so we define our model in this spirit; see 
Evansl ( 20141 ). 


Definition 2.8. Let Q be an mDAG with vertices V U W. The marginal 
model for Q is the V -marginal model for Q, the canonical DAG associated 
with Q. Denote the collection of such distributions by M.(Q). 


In other words, the marginal model is the collection of distributions which 
could be constructed as the margin of a Bayesian network with latent vari¬ 
ables replacing the bidirected edges. Any latent variable model (i.e. possi- 


7 



















(a) 



Figure 5: (a) An mDAG, Q , and (b) a DAG with hidden variables, Q, 
representing the same model (the canonical DAG). 
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bly with parametric or other distributional assumptions on the latent vari¬ 
ables) lies within the marginal model. If Q is a (conditional) DAG then the 
marginal model is just the usual model defined by the factorization. 

From the definitions above, the set of marginal DAG models that can be 
represented by marginal models for mDAGs appears to be restricted to cases 
where the latent variab les have no parents. In fact this does not cause any 
loss of generality (see Evansl . 2014 1. 

Just as distinct DAGs may be Markov equivalent, distinct mDAGs may 
give rise to the same marginal model: for example the graphs 1 •<— 2 •<— 3 
and 1 <r- 2 —> 3 and 1 ■<->■ 2 —> 3 are give rise to the same model. Some 
partial equivalence results for mDAGs are presented in Evansl ( 2014 1. 


Definition 2.9. A collection of (random) vertices C C V in an mDAG Q 
is bidirected- connected if for any distinct v, w € C, there is a sequence of 
vertices v = vo, v±,..., Vk = w all in C such that, for each i = 1 ,..., k, the 
pair is contained in some bidirected edge in Q. 

A district of an mDAG is an inclusion maximal bidirected-connected set 
of vertices. 


More informally, a district is a maximal set of vertices joined by the red 
edges in an mDAG. It is easy to see from the definition that districts form 
a partition of the random vertices in an mDAG. The mDAG in Figure |4] 
for example, contains three districts, {1}, {3} and {2,4}. Districts inspire a 
useful reduction of mDAGs, via the following special subgraph. 

Definition 2.10. Let Q be an mDAG containing vertices D C V. Then 
Q[D] is the subgraph of Q with 

(i) random vertices D and fixed vertices pa g{D) \ D\ 

(ii) those directed edges iu —> v such that w £ D U pa g{D) and v € D\ 

(iii) the bidirected edges {B n D : B G B{Q) and \B n D\ > 2}. 

Q[D] is therefore the subgraph induced over D, together with parents of D 
and edges directed towards D. Any edges between parent vertices (whether 
directed or bidirected) are ignored. 

For the graph in Figured] the subgraphs <5[{1}], !/[{3}] and t/[{2,4}] are 
shown in Figures [6} a), (b) and (c) respectively. Note in particular that the 
edge 2 —>• 3 is not included in the subgraph C/[{2,4}]. 

Proposition 2.11. Let Q be an mDAG with districts D\,... ,Dk- A prob¬ 
ability distribution P with density p is in the marginal model for Q if and 
only if 

k 

p(xy | Xw) = Y\_Si(xDi l^pa (Di)\Di), 

1=1 


9 









(c) 

Figure 6: Subgraphs corresponding to factorization of the graph in Figure 
[4] into districts. Parent nodes of the district are drawn as squares. 


where each g * is a probability density in the marginal model for Q[Df\. 

In addition, p is in the marginal model for Q only if for every v £ sterileg(V) ; 
the marginal distribution 

p(xv\v I x w ) = ^2 p ( Xv I Xw ^> 

x v 

is in the marginal model for Q_ v . 

Proof. Consider the factorization of the canonical DAG Q. The first result 
follows from grouping the factors according to districts and noting that there 
is no overlap in the variables being integrated out. The second result follows 
from noting that if v has no children, the variable x v only appears in a single 
factor, and this term is a conditional distribution that integrates to 1. □ 

This result tells us in particular that we need only consider mDAGs con¬ 
taining a single district, since the characterization of the model can always 
be reduced to such graphs. 


2.3 Relationship between mDAGs and ADMGs 


Previous papers considering marginal models for DAGs have used acyclic 
directed mixed graphs, which are the restrictio n of mDAGs with random ver¬ 


tices so that each bidirected edge ha s size two (IRichardsonl . 1200.1 : IShnitser et, al 
2012 : Evans and Richardson . 2014). 


From the perspective of the nested Markov property this restriction is not 
a problem, because if we replace each bidirected hyper-edge in an mDAG 
with all its subsets of size 2, we reach a conditional ADMG which under 
the nested Markov property yields the same model. Thus the results of this 
paper show that, algebraically, the model defined by having a single latent 
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Figure 7: (a) An mDAG on three vertices representing a saturated model; 
(b) the bidirected 3-cycle, the simplest non-geared mDAG. 


parent for several variables is the same as having pairwise parents: contrast 
the mDAGs in Figure [71 which both represent models of full dimension. 

However if we consider the marginal model in full this is false, as the restric¬ 
tion to pairwise latent parents will generally lead to additional inequality 
constraints. Hence the marginal model for the^mDAG in Figure [7](b) is 
strictly smaller than the one for [T]( a) (Fritz, 2012, Proposition 2.13). See 
Evans ( 2014 1 for a more detailed discussion. 


3 Nested Markov Property 


The nested Markov property is defined via constraints satisfied by the marginal 
model, including conditional independences and ‘do rmant independences’ 


such as the Verrna constraint in Example 11.11 (Shpitser et al., [2014). The 
property is defined in the following recursive way. 


Definition 3.1 (Nested Markov Property). A conditional density p obeys 
the nested Markov property for an mDAG G(V, W ) if V = 0, or both: 


1. p factorizes over the districts D ±,... ,Di of Q : 


i 

p(x v \x w ) = Y[gi(x Di \x MDi) \ D .) 
i=l 

such that gi is a distribution obeying the nested Markov property with 
respect to G[Dj\\ and 

2 . for each v € V such that chg(u) = 0, the marginal distribution 


p(x V \v I x w ) = ^2p{xv I x w ) 

X v 


obeys the nested Markov property with respect to Q(V \ {v},W), the 
subgraph induced by removing v. 
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We denote the set of distributions that obey the nested Markov property 
with respect to the rnDAG Q by A f{Q). 

Example 3.2. Consider again the mDAG in Figured! Applying criterion 
1 to this graph implies that 

p(xi,x 2 ,x 3 ,x 4 ) = gi(xi) ■ g 2 i{x 2 ,X 4 \ xi,x 3 ) ■ g 3 (x 3 \ x 2 ) 


for some gi, g 3 and g 2 4 obeying the nested Markov property with respect to 
the mDAGs in Figures Ufa), (b) and (c) respectively. Applying the second 
criterion to g 2 4 and the now childless vertex 2 (see Figure EKc)) gives 


y^ff24(z2,%4 | Xl,X 3 ) = h(x4 | X 3 ) 
X2 


for some function h independent of x\\ this is precisely the Verrna constraint. 

The marginal model implies additional conditions on joint distributions, 
because although it is also closed under marginalization of vertices without 
children (as in condition 2), this is not sufficient to describe the joint dis¬ 
tribution. In particular, for p to be in the marginal model, q 2 a mu st satisfy 
Bell’s inequalities (see, for example, ver Steeg and Galstvanl . 2011). 


The nested Markov property is sound with respect to marginal models, 
in the sense that all constraints represented by the former also hold in the 
latter. This is formalised in the following result. 

Theorem 3.3. For any mDAG Q we have M(Q) C M{Q). 


Proof. This follows from the fact that the nested Markov model is defined 
in terms of constraints which are proven in Proposition 12.111 to be satisfied 
by the marginal model. □ 


Definition 3.4. Let Q be an mDAG with random vertices V. For an arbi¬ 
trary set C C V, define sterileg(C') = C \ pa g{C). In words sterileg(C) is 
the subset of C whose elements have no children in C. We say a set C is 
sterile if C = sterileg(C'). 

Let Q be an mDAG. A subset of vertices Q is called intrinsic if it is a district 
in any graph which can be obtained by iteratively applying operations of 
the form 1 and 2 in Definition 13. II 

Given an intrinsic set, S, define H = sterileg(S') to be the recursive head , 
and T = pa^(5) the tail , associated with S (note that H and T are disjoint). 
The collection of all recursive heads in Q is denoted Fl{Q). 

Lastly, define 


A(Q) = {H U A | H e H(G), A C T}. 


to be the parametrizable sets of Q. 
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Example 3.5. The mDAG in Figure 2] has districts {1}, {3} and {2,4}, so 
these are all intrinsic sets. Further, in the subgraph C/[{2,4}] the vertices 2 
and 4 have no children, so we can marginalize either to obtain {2} and {4} 
as intrinsic sets. The corresponding recursive heads and tails are then: 


s 

1 

2 

3 

4 

2,4 

H 

1 

2 

3 

4 

2,4 

T 

0 

1 

2 

3 

1,3 


Note that intrinsic sets and recursive heads consist only of random vertices, 
but that tails may include both random and fixed vertices. 

Proposition 3.6. Let C be a bidirected-connected set in an mDAG Q; then 
there exists an intrinsic set S such that CCS and sterileg(5) C sterileg(C). 

Proof. The district containing C is intrinsic by definition, so there exists an 
intrinsic set containing C\ let 5 be a minimal intrinsic set (by inclusion) 
containing C. By the definition of intrinsic sets S is a district in some 
graph reached by iteratively applying the operations 1 and 2 to Q: applying 
operation 1 again gives the graph Q [A]. 

Suppose for contradiction that there exists v € sterileg(S') \ sterileg(C); 
then v (j C, since otherwise some child of v would be in C, and therefore in 
S. In addition, v is childless in the subgraph Q [5], so we can remove v under 
operation 2 of Definition 13.11 In the resulting strictly smaller graph, C is 
still contained within one district, say S', since C is bidirected-connected; 
in addition S' is also intrinsic, so we have found a strictly smaller intrinsic 
set S' C C, and reached a contradiction. □ 

We use the A operator to denote the symmetric difference of two sets: 

AAB = (A \ B) U (B \ A) 

Given a collection of sets A;, i = 1,..., k indexed by a finite set I, let 

k 

f\ Aj = Ai AAoA • • • A Ak. 

1=1 

denote the symmetric difference of all the Ai. That is, it is the set containing 
precisely those elements a which appear in an odd number of the sets Ai. 

The following result gives a characterization of the parametrizable sets in 
terms of symmetric differences which will be fundamental to our proof of 
the main results in this paper. 

Lemma 3.7. A set A £ A(G) if and only if there exists a bidirected- 
connected set C = {ui, ..., Vk} in Q , and sets Ai, i = 1 ,..., k, of the form 

{ Vi} c Ai C {Vi} U pa g(vi), 
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such that 


k 

A= /\A i = A 1 A---AA k . (4) 

i=l 

Proof. Suppose that A E A(G) m , then H C A C H U T for some head-tail 
pair (H,T), with associated intrinsic set S. Then let C = S, since intrinsic 
sets are by definition bidirected-connected, and consider sets of the form 
O- Such a set always contains H , because each Vi E H appears in Ai and, 
by sterility, in no other set. Every vertex t E T is (by definition) the parent 
of some vertex Vj in S, so we can choose whether T appears in a set of the 
form (j4]) simply by choosing whether or not to include t in Aj. 

Conversely, suppose that A is of the form (|3|) for some bidirected-connected 
set C\ let S be an intrinsic set satisfying the conditions of Proposition 13.61 
and (H, T) be its associated head-tail pair. Then the head H = sterileg(S) C 
stei'ileg(C). Each Vi E H C C appears in A, since Vi E Aj if and only if 
i = j. Also A C C U pa g{C) C S U pag(£) = H U T, so A E A(Q). □ 

The following corollary will allow us to generalize our later results to graphs 
which are not geared. 

Corollary 3.8. Let G be an mDAG, and A E A{G). Then there exists a 
geared mDAG Q' C Q, such that A E A(G'). 

Proof. By Lemma 13.71 A is of the form (|3|) for some bidirected-connected 
set C. Let Q' have the same vertices (random and fixed) and directed edges 
as Q, but be such that the set C is singly connected by bidirected edges 
(i.e. the edges are all of size 2 and removing any of them will cause C to be 
disconnected) chosen to be a subgraph of Q. Then Q' is geared by standard 
properties of trees and running intersection, and using Lemma 13.71 again we 
have A E A{G')- □ 


3.1 Parametrization of the nested model 


The nested Markov model can be parameterized with parameters inde xed by 
head -tail sets, similarly to the ord i nary M arkov model (jEvans and Richardson! . 
2014l l : see Evans and Richardson ( 201 fil l for details. We now state some con¬ 


sequences of this. 


Theorem 3.9. For a state-space %vw the set Af(G) is semi-algebraic, and 
the variety defined by its Zariski closure is irreducible. Further, the model 
does not have singularities within the strictly positive probability simplex, 
and has dimension 


d(Q,Xvw)= ^2 {\%h\ — 1) ■ |£t|- 
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Proof. That the model is semi-algebraic and has an irreducible Zariski clo¬ 
sure follows from the fact that the model can be defined parametrically (see, 


m 


fo r example. ICox et all 1200711. The smoothness and dimension are proved 


Evans and Richardsonl ~ 2015h . 


□ 
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4 Proof Outline 


The results in Sections [5] and [6] are fairly technical, so at this point we 
present a sketch of our approach with a particular example. The main re¬ 
sult is proved by showing that tangent cone at the uniform distribution of 
the marginal model is the same as the tangent space defined by the nested 
Markov model. To achieve this, we decompose these tangent spaces accord¬ 
ing to subsets of the vertices in the graph. The following decomposition of 
the vector space will prove uesful. 

Definition 4.1. Let Aa be the subspace of consisting of vectors p 

such that 

(i) Y,y a ex a P(ya, x v\a ) = 0 for each a £ A and x v \ {a} £ 3-V\{a} 5 

(ii) p{xy) = p(yv) whenever xa = UA- 

In other words, considered as a function p : Xy —>■ M, the value of p 
only depends upon xa, and its sum over x a for a £ A (keeping the other 
arguments fixed) is 0. In particular Ag is the subspace spanned by the vector 
of Is. In the case where all the variables are binary, each Aa corresponds 
to the space spanned by the corresponding column of a log-linear design 
matrix. 

Proposition 4.2. The real vector space can be decomposed as the 

direct sum 

r|3M = 0 

ACV 

In fact, the spaces A.a and A b are orthogonal if A ^ B. 

Proof. See Appendix O □ 

A multivariate model defined by conditional independence constraints al¬ 
ways contains the uniform distribution 

Po(xy) = |AV| _1 , xy £ Xy , 

at which point all variables are totally independent. Now, we will show that 
the tangent cone around of po of all the models we consider is a vector space 
of the form 

(J) Aa, (5) 

AgA(G) 

for some collection A(G) of non-empty subsets of V. We will refer, infor¬ 
mally, to each of the spaces A a as a ‘direction’, and show that we can perturb 
the distribution po in any such direction in A(G). That is, for any vector 
q in (JSD , a distribution of the form po + VQ + 0(r] 2 ) is contained within the 
model, so the tangent cone of the model around po contains the vector space 

A a- 
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Figure 8: (a) An mDAG, and (b) its canonical DAG with functional latent 
variables. 


4.1 An Example 

We will illustrate the main ideas of the proof by considering the graph in 
Figure [3 One can check that the nested model for this graph is defined by 
the constraint X\ X A 3 , A 4 , A 5 , which means that the model lies within the 
space orthogonal to A 13 + A 14 + Ai 34 + A 15 + Ai 35 + Ai 45 + Ai 345 . We will 
show that the associated marginal model can be ‘perturbed’ in every other 
direction around po- 

To achieve this we will fix the state-space of the latent variables to be a 
series of random functions that (once generated) determine the value of the 
observed variables. This process is formalized for a large class of graphs in 
Section [5j 

Consider the mDAG in Figure [3(a). The vertex 2 has the observed parent 
1 and one latent parent, so we can assume without loss of generality that 
this latent variable contains a (random) function /2 : £4 —> X 2 which tells 
A 2 which value it should take, depending upon the value of its parent X\. 
One can show that it is sufficient to fix the state-space of this latent variable 
to be the finite set of functions J -2 = {/2 : £1 —^ £ 2 }- 
Now, we note that the vertex 3 has three parents: 2, the latent vertex now 
labelled / 2 , and one other latent vertex. Since we have fixed the state-space 
for all but one latent variable, we can use the same argument to say that 
without loss of generality the other contains a random function / 3 : £2 —> £ 3 
that fixes the value for A 3 given f 2 - In fact we can fix the second latent 
variable to be the collection of functions £3 = {/ 3 : £2 £ 3 }- 

A similar argument works for A 4 and leads to fixing the state-space of the 
final latent variable as £'4 x £ 5 , where 

£4 = {/ 4 : £3 x £3 £4} £5 = {f 5 ■ X 4 -> * 5 }; 

(see Figure [3(b)). We can initially assume that each function is generated 
by independently and uniformly sampling a value of its output for each 
combination of values in its input. This will lead to a completely uniform 
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distribution po over the observed variables. Let TCo be the tangent cone of 
the model M.{Q) around pq. 

Clearly any marginal distribution of X\ can be obtained, so Ai C TCo- 
The function /2 controls precisely the conditional distribution of X 2 given 
X\, so by manipulating the distribution used to generate /2 we can obtain 
any conditional distribution we desire. This shows that the vector space 
A 2 + A 12 is contained in the tangent cone of our model around pq. 

The function fy controls the distribution of A 3 , so by the same reasoning 
we can show that A 3 C TCo- However, in addition we can change the way 
A 3 responds to different values of its parent / 2 . Let Ao be any ‘direction’ 
that can be perturbed by manipulation of /o (i.e. {1} or {1,2}). We will 
show (Lemma 16.91) that because fi is an argument of the random function 
/ 3 , manipulation of the distribution of fs allows us to perturb po in any 
‘direction’ A 3 of the form A 3 = {3} U A 2 . In other words we can show that 
A23 + A123 is contained in the tangent cone. Note that the only subset of 
{1, 2,3} we have not obtained so far is {1,3}, and since X\ X A 3 under this 
model we know that this direction cannot be perturbed. 

By similar reasoning f '4 has as an argument, so we will be able to ma¬ 
nipulate the sets {3,4}, {2,3,4} and {1,2,3,4}. However f,± also has the 
argument A3, meaning that it can control the conditional distribution of A4 
given A3, and hence the directions {4} and {3,4}. By combining the depen¬ 
dence upon these two variables we will show that in fact we can push in any 
direction of the form A4 = { 4 }AA 3 or A4 = {3,4} A A3, which, all told will 
allow us to obtain any of the directions in A4+A24+A34+A124+A234+A1234. 

Finally, the function {5 controls the conditional distribution of A 5 given 
A4, so by manipulating its distribution we can obtain the A5+A45 directions. 
However because we can alter the joint distribution of (/ 4 , /s) in an arbitrary 
way, we can actually obtain any direction of the form A4AA5, giving the 
additional sets 

25 125 35 235 1235 245 1245 345 2345 12345. 

All directions are now accounted for, and since they are all obtained by 
local perturbations of a particular parameterization, in fact the different 
directions form a vector space; the tangent cone TCo is therefore the tangent 
space orthogonal to A 13 + A 14 + A 134 + A 15 + A 135 + A 145 + A 1345 and the 
marginal model is locally equivalent to the model of independence Ai X 
X 3 ,X 4 ,X 5 . 
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5 Geared mDAGs 


We define a special class of mDAGs which we term ‘geared’. For such 
graphs, the state-space of the hidden vertices can be restricted without loss 
of generality, making proofs concerning the marginal model considerably 
easier. 


Definition 5.1. Let Q be an mDAG with bidirected hyper-edge set B. We 
say that Q is geared if the elements of B satisfy the running intersection 
property. That is, there is an ordering of the edges B \,..., such that for 
each j > 1, there exists s(j) < j with 

Bj n |J B t = Bj n B s{j) . 

i<j 


In other words, all the vertices Bj shares with any previous edge are con¬ 
tained within one such edge. 

A particular ordering of the elements of B which satisfies running intersec¬ 
tion is called a gearing of Q. 


The term ‘geared’ is chosen because a collection of bidirected edges which 
satisfies running intersection may appear rather like ‘cogs’ in a set of gears: 
see Figure [5j The definition is very similar to the idea of decomposability 
in an undirected graph; however we avoid using this terminology, because 
DAGs (which have no bidirected edges are therefore tri viall y geared) may 
or may not be decomposable in the original sense ( Lauritzen . 1996h . 


Example 5.2. The simplest non-geared mDAG is the bidirected 3-cycle, 
depicted in Figure [7Kb); there is no way to order the bidirected edge sets 
{1,2}, {2,3}, {1,3} in a way which satisfies the running intersection prop¬ 
erty, since whichever edge is placed last in the ordering shares a different 
vertex with each of the two other edges. 


Given a single-district, geared mDAG with at least one bidirected edge and 
a gearing B \,..., Bk, define 


Bj — Bj \ [^J Bi 
i<j 

(taking Ri = B\). We call Rj the remainder set associated with Bj , and the 
remainder sets partition the random vertices V. In addition, for a random 
vertex v € V, define r(v) to be the unique j such that v € Rj. 

Now say that an ordering < on the vertices in V respects the gearing if for 
v £ Ri and w £ Rj, we have v < zu whenever i > j; in other words, all the 
vertices in R & precede all those in Rk-i, etc; such an ordering always exists. 
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For each v £ V with r(v) = j, define 

tt(v) = |J 

i>j 

v&Bi 

that is, the remainders associated with all bidirected edges which contain v 
and are later then j in the ordering. Then define a collection of functions 

J~v — {f ■ £pa(y) ^ ^ -^y}; 

where J-a = ^aeA^a and = £@ = {1}. This is valid recursive definition, 
since all the vertices in ir(v) precede v in an ordering which respects the 
gearing. 

Example 5.3. Consider the mDAG in Figure [5l and order the bidirected 
edges as 


B 1 = {1,2}, B 2 = {2, 3,4}, B 3 = {3,4, 5} 

giving remainder sets 

= {1,2}, ii 2 = {3,4}, Rs = {5}. 

The ordering 5<4<3<2<lof the random vertices respects the gearing, 
and we have 

vr(l) = tt(5) = 0, tt(3) = tt( 4) = {5}, vr(2) = {3,4}. 

In this case then 

£5 = {/ : * 3 ^ X 5 } 

£4 = {/ : £2,3,6 x 7 5 4 £4} 

£3 = {/ : £1 x £ 3 } 

£2 = {/ : £ 3,4 -> £ 2 } 

£1 = {/ : {1} —> £ 4 } (or equivalently T\ = £ 1 ). 

Alternatively, if we order the bidirected edges as {2,3,4}, { 1 , 2 }, {3,4,5}, 
then we could take 1<5<2<3<4, and 

vr(l) = vr(5) = 0, vr(3) = vr(4) = {5}, tt(2) = {1}; 

this yields £2 = {/ : £} —> £ 2 }, with other collections £} remaining un¬ 
changed. 
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5.1 Functional Models 


The property that makes geared graphs useful is that we can find a latent 
variable model with all variables discrete that yields the same set of distribu¬ 
tions over the observed variables as the marginal model. This fact provides 
a tool with which to attack the main result of this paper, and demonstrate 
the true dimension of mDAG models. 

If a vertex v is contained within exactly one bidirected edge, B, then with¬ 
out loss of generality we can assume that the latent variable corresponding 
to B contains all the residual information about how X v should behave given 
the values of its visible parents, A pa („). In other words, the latent variable 
associated with B contains a (random) function f v : X pa („) —> £„ which 
‘tells’ X v = f v (X pa (t,)) which value it should take for each value of its other 
parents. 

However, if v is contained within two or more bidirected edges, say Bi 
and Bj, it is not clear how to define such a function until the state-space 
associated with one of these latent parents has already been fixed. The 
decomposable structure of geared graphs makes it possible to iteratively fix 
finite state-spaces for each bidirected edge without loss of generality. 

Specifically, for a single-district, geared mDAG Q with remainder sets 
R \,..., Rk , first form the canonical DAG Q by replacing each bidirected 
edge B{ in Q with a new vertex Ui, such that ch g(ui) = B{. Compare, for 
example, the structure of the graphs in Figures 0a) and (b). Then define 
independent latent variables 

Ui — (fv G J~ u | V G Rf ), i — 1, . . . , /u, 

For example, with the first gearing given in Example 15.31 for the graph in 
Figure [5])a), we would have 

= u 2 = (f 3 ,h), U 3 = (/ 5 ). 

Associating each variable Ui with the vertex u t leads to the DAG in Figure 
O Notice that, for each v G V, the function f v is contained within a parent 
variable of v. In addition, all the arguments of the function f v are also 
parents of v. 

For example, take v = 4, and note that G is determined from U 2 = 
(/ 3 ,/ 4 ), and the associated vertex 112 a parent of 4. In addition, T 4 = {/ : 
-£ 2 , 3,6 x J -5 —>• £ 4 }, so the arguments of the function /*, namely X 2 , A 3 , Xq 
and all correspond to vertices which are also parents of 4 in Figure [9] (see 
Figure fTH . Thus, in setting A 4 = f 4 (^X 2 , X 3 , Xq, f$) we ensure that A 4 is a 
well defined function of its parent variables. 

In fact using this construction we can set 

Ati fv i f- k(v) Ap a (^)) 
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Figure 9: A DAG with functional latent variables, associated with a gearing 
of the mDAG in Figure 0a). 



Figure 10: Subgraph of the DAG in Figure [9] containing the vertex 4 and its 
parents. 
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for every v € V, which is well defined because the directed part of the 
original mDAG is acyclic. The following result shows that the resulting 
conditional distribution over Xy given X\y is in the marginal model for the 
original mDAG. 

Theorem 5.4. Let Q be a geared mDAG, and Ri,i = 1, ..., k be the remain¬ 
der sets corresponding to some gearing of Q. Suppose we generate functions 
f v £ J- v according to a distribution in which 

(fv | V £ Ri) X (f w I W £ V \ Ri), 
for each i = 1,..., k, and then define 


Xy fv (f- k(v) ' ^pa(u)) 5 V £ VI 

Then the induced conditional distribution P on Xy given X\y is in the 
marginal model for Q. 

Proof. For each bidirected edge Bi, define the random variable Ui = (f v | u £ 
Ri). The UiS are represented by exogenous variables on the DAG Q, and the 
conditions given in the statement of the theorem ensures they are all inde¬ 
pendent. The structural equation property for Q will therefore be satisfied 
if each X v is a well defined function of its parents in the graph. 

In other words, the three components f v , f^M and X pa r v \ must all be 
determined from random variables which are parents of v in Q. This holds 
for A' pa („) by definition. Additionally v £ Ri implies that v £ Bi, and that 
therefore the variable Ui is a parent variable of X v ; then since the function 
f v is just a component of Ui, this is indeed determined by a parent of v. 

Lastly suppose w £ n(v); this happens if and only if w, v £ Bj for some j > 
i, in which case w £ Rj for the minimal such j by the running intersection 
property of the gearing. Then f w is contained in Uj, which is also a parent 
variable of X v . 

Thus f v , /A,,') and A pa( -, ; \ are all well defined functions of parent variables 
of v, and so setting X v = f v (f„( v \, X. pa ( t ,)) respects the Markov property of 
the graph. □ 


The idea of this formulation is that f v is a random function that ‘tells 
X v what to do,’ or rather what value to take, given the values of its other 
parents. If some of those other parents are also latent, then they must be 
defined first, and the need to do this in a well-ordered manner explains why 
it is necessary for Q to be geared. 

In fact it follows from a slight variation of Proposition 5.2 in Evans ( 20141 ) 
that any distribution in the marginal model of a geared graph can be gen¬ 
erated in the way described in Theorem 15.41 Since each of these latent 
variables takes values in a finite collection of functions, this means that the 
marginal model is equivalent to the margin of a Bayesian network in which 
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(a) 



(b) 



Figure 11: (a) An mDAG representing the instrumental variables model; (b) 
a DAG with functional latent variables equivalent to the potential outcomes 
model of instrumental variables. 


all the random variables (latent and observed) are finite and discrete. It 
follows that marginal models for geared mDAGs are semi-algebraic sets. 

Example 5.5. Consider the rnDAG in Figure fTTlfal. which represents the 
instrumental variables model. This is used, for example, to model non- 
compliance in clinical trials, with X\ representing a randomized treatment, 
X 2 the treatment actually taken, and A 3 a clinical outcome. 

This mDAG has only one bidirected edge and therefore is trivially geared 
with R\ = B\ = {2,3}. This leads to functional latent variables / 2 and f 3 , 
where 


A 2 = {/ : X, X 2 } 
^3 = {/ : X 2 -j. X 3 }. 


The resulting DAG model is shown in Figure fTITbl. The function / 2 defines, 
for an individual, which treatment she actually takes given which arm of the 
trial she is assigned to; this is known as her compliance type. Similarly /3 
determines what the patient’s outcome will be given each possible treatment 



We note that for our purposes the functions f v are a purely mathematical 
construct, and thus philosophical questions about the nature and existence 
of potential outcomes have no direct bearing on the results herein (see, for 
example, Dawid, 2QQol l. 
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6 Main Results 


In this section we provide the technical results to prove the main theorem. 
This is done first for geared nrDAGs, and the result is then extended to 
general graphs. 


6.1 Distributions for Geared mDAGs 


Let Q be a single-district, geared mDAG, with gearing given by remainder 
sets R\ ,..., Rk ; assign a probability distribution p* to each collection of 
functions Ui = (/„ \v £ Ri). Suppose we draw variables Ui = ( f v )veRi 
independently according to pi, and use them to generate observed variables 
Xy for each possible value of the fixed vertices Xy/. Applying Theorem 15.41 
the resulting (conditional) distribution over the observed variables, say P , 
is in the marginal model for Q. 

Let n(Ri) = (J,,^ n ( v ) and /a = (fv I v £ A). Define 

p[p k ,...,pi](xv\x w ) = E M/fil): 

® 'k(?Vw) ^lifniRy^Vw) 

Where 

xvw) = {fRi I fv(x pa .( v ), fn( v )) = x v for each v £ i^}; (6) 

that is, gives us precisely the set of functions frp that, given appropriate 
values of parents variables, jointly evaluate to xR r Ultimately, then, we have 
a sum over combinations of functions fy that, given the input Xy/ = xyy , 
jointly evaluate to xy. 

The function p[-] returns a vector indexed by xyw representing the induced 
conditional probability distribution of Xy given Xy/. For brevity we will 
generally denote this as 

p[pk,---,pi\ = Pk ■ ■ ■ pi, 

with the dependence upon xyw left implicit. It may be helpful to think of 
this as a family of probability distributions for Xy given Xyy, indexed by 
parameters p \,..., pk ■ 

Example 6.1. In the case of the mDAG in Figure [Dj a) we have three 
bidirected edges and remainder sets, and the gearing used in Figure [9] gives 

p[p3,p2,pl] = ^2 P3(f5)^2 P2(f3, Pl(fl, f2), 

3>3 ^2 ^1 

where 


$1 = {U 1 J 2 ) ■ fl = Xy / 2 (/ 3 ,/ 4 ) = X 2 } 

$2 = {(/3, h) ■ f3(xi) = X 3 , f 4 (x 2 ,X 3 ,X 6l f 5 ) = X 4 } 
$3 = {/5 : h(x3) = x 5 }- 
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By Theorem 15.41 for any pi ,..., p^, the induced distribution p[pk, ■ ■ ■, Pi] 
on Xy given X\y is in the marginal model for Q. 

It is clear that choosing Piifrtf) = 1 for each i (up to a constant of propor¬ 
tionality which, for simplicity, we ignore) induces the uniform distribution, 
Po, on Xy for each xw G X\y. In other words, the uniform distribution 
Po = p[l,..., 1] is contained within A4(Q) for any mDAG Q. 

6.2 Tangent cones 

Definition 6.2. Let 21 be a subset of containing a point x. The tangent 
cone of 21 at x is the set of vectors v which are of the form 

v = lim r)~ l {v n - x) 

11 —> OO 

where r/ n —> 0 and each v n £ 21. 

A tangent cone is a cone, but may or may not be a vector space, depending 
upon whether the set 21 is regular at x. We claim here, though will not 
need to prove directly, that the tangent cone of A f(G) around the uniform 
distribution po is the vector space 

TSq = © A A . 

AeA(G) 

In fact we will show that this vector space is equal to TCo, the tangent cone 
of Ai(Q) at the uniform distribution, and the characterization of TSq given 
here will follow from the fact that M(Q) C A f{Q) and dimension counting. 

Definition 6.3. Let A : Xa -A- K; we say that A is A-degenerate (or just 
degenerate) if for each a £ A, and XA\ a € X v a , 

X]A (y a ,X A \a) = 0. 

Va 

It is clear that the set of A-degenerate functions is isomorphic to the vector 
space A^, though both formulations will be useful. 

The main result of this section follows. 

Theorem 6.4. The tangent cone of M(Q) around po is the vector space 

TCo = © A A- 
AeA 

The proof is delayed until the end of the section. We note that the tangent 
cone is a vector space, and it has the same dimension as the nested model. 
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6.3 Results for Geared Graphs 

Definition 6.5. Given a degenerate function E{ : }-R i —> M, define 
Di(£i) = lim r ? _1 {p[ 1 ,... , 1 + ??£*,•• -4] ~ p[ 1, • • • 4,- - ■ 4]}, 

so that Di(ei) is a vector in For sufficiently small 77 > 0, 1 + rye* 

is non-negative and therefore a valid distribution over Ere, it follows that 
Di(ei) £ TCo(Q), the tangent cone of M.{Q) at po- 
Let 


Tj = {Di(ei) | £$ degenerate}. 

Then Tj is a vector space, since the function p[-] is differentiable at (pk,... ,pi), 
and Ti + • • • + Tfc is contained within the tangent cone of M. around the uni¬ 
form distribution. 

It will be useful to define the following collection of supersets of <hj, for 
B C V: 

$?(/tt (Ri),xvw) = {fRi I fv(x pa .( v ), fn( v )) = %v for each v £ Ri n B}. (7) 

Lemma 6.6. Let C C R^, with sterileg(C) C A C C U pa g(C) and E C 
7r(C). Then for every degenerate function 

A : 34 X J~e —> M, 


fhere exists a degenerate function 5 : Ec —>• M such that 

E S(f c ) = X(xaJe), 

fR t 

where $4 is given by In addition, 


E <*(/c) 


= J |X RAB |A(^,/ B ) 7/cct 
0 otherwise. 


Proof. See appendix, Section IA.31 


□ 


Remark 6.7. Note that if we set E = $, the above result shows that for 
appropriate A and 5, 


r] 1 {p[l,..., 1 + 7/(5, ... 4] - P[l, • ■ • 4, • • • 4]} 


r i 1 { E ''' E E''' E 1 

3>i-i *1 


oc A. 
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Hence < Tj (i.e. is a subspace of X)) for any A such that sterileg(C') C 
ICCU pag(C) and CCRj. 

This tells us that we can obtain certain directions in our model’s tangent 
space by only manipulating the distribution of a single latent variable, Ui. 
For the full range of this space to be achieved it will be necessary to manip¬ 
ulate the distribution of several adjacenlQ latent variables in a co-ordinated 
way. 

We will extend the previous result to bidirected-connected sets which span 
multiple remainder sets, though we need the following lemma to ensure that 
distribution over our sum works as expected. 

Lemma 6.8. Let Q be a single-district, geared mDAG, and C a bidirected- 
connected set of vertices. We can construct a rooted tree nc with vertex 
set 


Ic = {i\RiL\C 0}, 

and such that i —> j only if there exist Vj £ Rj fl C and Vi £ Ri D C such 
that Vj £ vr(uj). 

Proof. See appendix, Section IA.4I □ 

The next result forms the backbone for proving Theorem 16.41 it extends 
Lemma l6.6l to sets C which may not be contained within a single remainder 
set. 

Lemma 6.9. Let C be a bidirected-connected set, and define Ci = C n Ri 
and I = {i \ Ci $}. For sterileg(Ci) C C Cj U pag(Cj), let 

A=/\A. 

iei 

Then A A < T), where l is the minimal element of I. 

Proof. By Lemma f6.81 there exists a rooted tree n with vertices /, such that 
i —> j in n only if there exist Vi £ Ri ft C and Vj £ Rj (1 C with Vj £ vr(uj). 
In particular i j only if Cj C ir(vi). 

Let l be the root node of n, and for each j £ chn(0 denote by n, the 
rooted tree with root j formed only from the descendants of j. 

Let A i : — > R be arbitrary A;-degenerate functions for each i £ I. Then 

starting with vertices which have no children (i.e. the leaves of the tree), and 
using Lemma 16.61 recursively define <5,- for i £ I as the degenerate function 
of fc\ such that 

x>c f Ci )=Xi(x Ai ) n 

jechn(i) 

4 Adjacent in the sense that they share an observable child vertex. 
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where the empty product is defined to be equal to 1. Then 

n a ? ifcj )• 

'L+i "L ®'k 'L+i iech n (0 

For each i G I an expression of the form 5j(/c;) is on ly a function of 
fc a for a G chn(i) and II is a tree, so the sum factorizes into components 
only involving the descendants of each j G chn(0 : 

J2"'J2 Si (fci) « Mx Al ) n 

&k $1 iSchn(i) 

seden(j) 

But then for each j the factor represents a disjoint sub-tree II j with root 
node j, so we can just iterate this process within each factor, and get 

oc Y[Xi{x Ai )- 
iei 

It follows that any function of the form 


IIM^) 

iei 

lies in T); since A a is spanned by such functions it then follows from Lemma 
IA.3I (see Appendix 0 that A^ < X). □ 

Corollary 6.10. For geared graphs Q, we have 

A a < Ti + ■ ■ ■ + Tfe. 

AgA(G) 

Proof. Reformulating Lemma 13.71 slightly, for any A G A(Q) there exists a 
bidirected-connected set C = ua = u>«i,.. , where Q = C fl Ri 
(we have changed nothing other than to label the vertices Vjj by which 
remainder set they are contained in). Then A is of the form 


a = A4 




A A4 

* \ 3 


for some sets A\ such that {%•} C Aj C {v VJ } U pa g(vij). 

Applying Lemma 13.71 in reverse to the bidirected-connected set C* shows 
that Aj = is in A(G), and therefore satisfies sterileq(C\) C Ai C 

Ci U pa g(Ci). Then by Lemma I6l9l the space A^ is contained in some X), 
i = 1,..., k. □ 
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Figure 12 : (a) an mDAG on 4 variables, and (b) a DAG with hidden variables 
corresponding to a gearing of the mDAG in (a). 


Example 6.11. Consider the single-district, geared mDAG in Figure [T 2 } a); 
the nested Markov model for this graph is saturated, and thus A(G) consists 
of all non-empty subsets of { 1 , 2 ,3,4}. 
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Consider the gearing 


i 

Bi 

Ri 

1 

{ 1 , 2 } 

{ 1 , 2 } 

2 

{1,3} 

{3} 

3 

{2,4} 

{4} 


and ordering 3 < 4 < 1 < 2 which respects this gearing. This leads to the 
hidden variable model in Figure fT 2 l bl: here 

J 3 • -X2 t X3 J4 : xj t ±4 

/i : ^ Ad / 2 : ^4 -> A 2 . 

Applying Lemma 16.61 to each remainder set in turn tells us that 

Ai + A 2 + Ai 2 < Ti 

A 3 + A 2 3 < T 2 

A 4 + A 14 < T 3 . 

We can apply Lemma ItTUl with the connected set C = {1, 2,3,4} to find that 
A^ < Ti, where A is of the form A = { 1, 2 }AA 2 AA .3 and 

{3}CA 2 C{2,3} {4}CA 3 C{1,4}; 
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Figure 13: (a) the bidirected 4-cycle, and (b), (c) two geared subgraphs. 


that is, A G {{3,4}, {1, 3,4}, {2, 3,4}, {1,2,3,4}}, and so 
A34 + A134 + A234 + A1234 < T \. 


Repeating with C = {1,2, 3} and {1,2,4} respectively gives A 13 + A 123 < T\ 
and A 24 + A 124 < T\. 


Thus for every non-empty A C {1,2,3,4} there is some i € {1,2,3} such 
that A ^4 < Tj, and therefore the tangent cone of Ai(Q) around the uniform 
distribution is the same as that of the saturated model on four variables. In 
other wo rds t he nested model and marginal model are both of full dimension. 

Evans ( 2012 ) shows that the marginal model associated with this graph 


induces some inequality constraints on the joint distribution, and so the 
nested and marginal models are not identical. 


6.4 Dealing with non-geared graphs 

Corollary 16. 101 put us in a position to prove Theorem 16.41 for geared graphs; 
however it does not so far extend to the general case, because we cannot 
fix the state-spaces of the latent variables without a gearing. In this section 
we will show that the tangent cone of a general marginal model around 
the uniform distribution is just composed of the tangent cones of its geared 
subgraphs, and that therefore the problem can be reduced to geared graphs. 

Proposition 6.12. Let Q be an arbitrary mDAG containing geared sub¬ 
graphs Gi,--.,Gk- Suppose that, for each subgraph and a suitable gearing 
A Ai < TCo (Gi) as a consequence of the earlier results in this section. Then 
Aai + • • • + A Ak < TCo(© 

In other words, the tangent cone of Q includes the vector space spanned 
by all the tangent cones of the subgraphs. 

Proof. First consider the case W = 0 and k = 2, from which the general 
result will follow similarly. 
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Let pi £ M.(G\) C Xi(G) be formed by random functions fy according to 
a gearing of Q \, and £>2 € Ad(^ 2 ) C M{G) by random functions fy. Let 
U v be independent Bernoulli(^) variables, and define a new distribution by 
setting 

Yy U v fv(fn(v)i^pa(v)') A (1 Uy) fv^f ^pa(u))j 

i.e. we randomly (and independently of all other vertices) choose one of 
the mechanisms f v or f v to generate Z v . Note that although f v and f v are 
independent the values of f v (f n ( v ), z P a(v)) and fv{U( v ), z P a(v)) are not, since 
they share parent variables. 

Denote the resulting joint distribution of Zy by p. It is clear that p £ 
Ai(G), since we are still generating each variable as a random function of 
its parents and some independent noise, which clearly satisfies the Markov 
property for Q. 

Now place a distribution over fy which is uniform except for a perturbation 
-nS(fci) which leads to a perturbation t]\a 1 (xa 1 ) over the observed joint 
distribution for Xy. Similarly for fy. Then 

Then we have 


P(Zy = Zy) = ^2 P(Ub = 1 ,Uy\ B = 0,X B = Z B ,Yy\ B = Zy \ B ) 
bcv 

= 52 P ( Xb = Z B- Y V\ B = Zy\ B ). 

BCV 

It follows from the proof of Lemma 16.91 that if A £ A(G) and Ayi € then 
there exists a degenerate 5(/cJ such that 


52" ■ 52 We*) ■■■52 1 = x ^( x a) 


and from Lemma 16.61 that 


<j>s qB <3>b ' 

K Z 1 

Now since the functions used to generate Xy and Yy 


ifCCB 

otherwise. 

are independent, 


P(X B — z B , Yy\ B — Zy \ B ) 


52 pk ■ ■ ■ 52( pi +^)---52 pi 


( 

52 ?*■■■ 52 h 


= (I^bT 1 + ^ciAa! + 0 ( 77 2 )) (|3£y ybI _1 + 7 1 c 2 x a 2 + 0 (if)) 

= \Xy\ 1 + v( c i Aai + c 2^a 2 ) + 0{rj 2 ). 
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It follows that 


P(Zy — Zy) — \Xy\ 1 + v(, c i^ Ai + c 2 ^A 2 ) + 0(r / 2 ), 

for some c" > 0. Then by an appropriate choice of scaling for each A Ai we 
see that A Al +A a 2 < TCo(<7). For non-empty W, we can draw Z\y = X\y = 
Y\y as a uniform random variable, and then look at Xy | X\y, the proof is 
otherwise the same. □ 

Example 6.13. The bidirected 4-cycle in Figure filial is not geared, and 
therefore we cannot apply our earlier results to it directly. The nested model 
for this graph, however, yields parametrizable sets 

A(G) = { 1 , 2 , 12 , 3, 23, 123, 4, 14, 124, 34, 134, 234, 1234} 

(these are just the bidirected-connected sets). The two subgraphs in Figures 
[13} b) and (c), say G\ and G 2 , are geared, however, and have parametrizable 
sets 


A(Gi) = { 1 , 2 , 12 , 3, 23, 123, 4, 34, 234, 1234} 

A(G 2 ) = { 1 , 2 , 12 , 3, 4, 14, 124, 34, 134, 1234}; 

therefore ®A&A{Gi)^ A — TCo(^i) for i = 1,2 by Corollary 16.101 Note that 
A(Gi) U A(Gi) = A(G), and therefore by applying Proposition 16.121 with 
these graphs, we find that 

Ayi + A a < TCoiG). 

A&A{Q) A^A(G 1 ) AeA(G2) 

We are now in a position to prove the main result for general rnDAGs. 

Proof of Theorem B Suppose first that Q is geared. 
p[pk,... ,Pi + r)Ei ,..., p\] obeys the nested Markov property for any degen¬ 
erate function £ t and p sufficiently small that 1 + pEi is positive; it follows 
that Tj < TCo for each i, and that therefore using Corollary 16.101 

A A < T\ + • • • + Tfc 

AeA(G ) 

is also contained in TCo, by the differentiability of p[-} at (pk, ■ ■ ■, p\). 

Now for general G , and each A £ A(G), there exists a geared subgraph Q' 
of Q such that A a C TCo (I/O by Corollary 13.81 Then applying Proposition 
16.121 we see that the space spanned by these subspaces is contained within 
the tangent cone for G'- 


© A a <TC 0 (S). 

AeA{G) 
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If a distribution is in the marginal model then it is also in the nested model, 
and therefore TCo is contained within the tangent space TSq of A f{Q) at po, 
which has dimension 


dim(TSg) = £ (|X^| - 1) • |X T | 

HeH(G) 

= ^ dim(A A ). 

AgA(G) 

Then combining 

© A a <TC 0 CTS£ 

AeA( 6 ) 

with the dimension of TSq gives the result. □ 


7 Smoothness of the marginal model 


The results of Section [ 6 j together with the smoothness of the nested model, 
allows us to show that for geared graphs, the marginal model is smooth 
almost everywhere. 

Theorem 7.1. For a geared graph Q and state-space %vw, the interor of 
the marginal model M.{Q) is a manifold of dimension d(Q,£vw)> an d its 
boundary is described by a finite number of semi-algebraic constraints. 


Proof. The nested Markov model is parametrically def ined, and therefore 
its Zariski closure is an irreducible variety (see, e.g. C ox et all . 2007, Propo¬ 
sition 4.5.5). Furthermore, there is a diffeomorphism between the set of 
strictly positive distributions obeying the nested Markov property, and an 
open parameter set. It follows that M(Q) is a manifold on the interior of 
the simplex. 

The marginal model is a semi-algebraic set, contained within the irreducible 
variety defined by the Zariski closure of the nested Markov model, so Ai(Q) 
is a subset of N(Q) defined by a finite number of additional polynomial 
inequalities. It follows that it is also a manifold at any point these inequality 
constraints are not active. □ 


It follows from Theorem o that the interior of the marginal model for a 
geared mDAG is a curved exponential family of dimension d(Q,Xyw ), and 
that therefore the nice statistical properties of these models can be applied. 
For example, the maximum likelihood estimator of a distribution within the 
model will be asymptotically normal and unbiased, and the likelihood ratio 
statistic for testing this model has an asymptotic xf% v | -d -i -distribution. 
For a point on the boundary defined by an active inequalit y con straint, the 
asymptotic distribution may be much more complicated (Dr ton . I2009bl ). 
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Inequality constraints are generally much more complicated than equality 
constraints, and effort s to ch a racte rize them fully have been limited by com¬ 
putational challenges. Evans ( 2012 ). generalizing a result first given by Pearl 
1 1995h . provides a graphical criterion for obtaining s ome inequalities, but de¬ 
rivin g all such bounds may be an NP-hard problem (jver Steeg and CalstvaiJ . 
201 il l. 

Geared mDAG models are semi-algebraic sets because they are given by 
variable elimination over a finite discrete latent variable model. However, for 
non-geared mDAGs we cannot assume that the latent variables are discrete 
without loss of generality, so it is conceivable that these marginal models 
may be defined by non-polynomial inequalities on the probabilities. We 
conjecture however, that a result akin to Theorem 17.II does hold for general 
graphs. 


7.1 Model Fitting 

In theory we can fit the marginal model for a geared graph using a latent 
variable model of the kind derived in Section [5j In practice this model is 
massively over parameterized and unidentifiable, with the state-space of sets 
J- v being potentially be very large even for modest graphs; this will cause 
problems for most standard fitting algorithms. However, for any graph Q — 
whether geared or not—and any latent variable model C(G), we have C{G) C 
M(G) C M(G). Fitting the nested mo del by maximum likelih o od (M L) is 
straightforward using the algorithm in Evans and Richardson ( 2010 ). and 
many ML methods for fitting latent variable models are available. If the 
estimates for these two models agree, then we have found the maximum 
likelihood estimate (MLE) for the marginal model; if not, then we at least 
obtain a range of possible values for the log-likelihood at the MLE, and can 
use this to confirm or refute the marginal model. 
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A Technical Proofs 
A.l Proof of Proposition EOl 

Proof of Proposition \ f.2\ Suppose that A,B C V are distinct sets, then we 
claim A a -L A b', to see this, assume without loss of generality that there 
exists a £ A \ B, and take any p € A^ and q £ A b- 
Then using the fact that q(xy) does not depend upon x a , 

£ p ( Xv ) ■ 9(zv) = £ q(x v ) £ p ( Xv ) 

Xy x V\a x a 

= £ q ( xv ^ ■ 0 

x V\a 

= 0 , 

so the claim holds. 

Now, we claim that the vector space A^ has dimension at least rLe^d-^l - 
1). To see this, give each state-space X a an element denoted 0, and let 
X a = 3i a \ {0}. We can freely pick values p(xa) for xa £ A-a as long as we 
then ensure that 

p(x A \ a ,0) = ~^2p{xA\a,y a )- 

Va 

Lastly, note that counting up the dimensions of each A^ gives 

e no^i-^n i*«i, 

A<ZV aeA aeV 
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which is the same dimension as Since the subspaces are all orthogonal, 

it follows that the direct sum gives the whole space. □ 


A.2 Degenerate Functions 

We present a series of Lemmas which build up to showing that we can 
construct degenerate functions from finite sums and products of degenerate 
functions with simpler argument sets. 

Lemma A.l. Let X be a discrete (AUB)-degenerate function, for AnB = 0. 
Then X can be written as a finite sum 

x =y 

i 

of A-degenerate functions X\, and B-degenerate functions X l B . 

Proof. Since a matrix can be written as a sum of rank one matrices, clearly 
we can find (not necessarily degenerate) functions such that the result holds. 
But now suppose that the A^ are not degenerate over a € A, and consider 

Y ( X a( x a) ~Y X A( X A\a,ya) J X l B {x B ) 

^A( x A\a,ya)\ B ( x B) 

i Va i 

= A (x A , XB ) - Y A4\a, x b) 

Va 

= A (x A ,x B ). 

Thus we can replace each A^ with the degenerate function 


Xa(xa) = I A a(xa) - Y X A.( x A\a,ya ) 

\ Va 

and not affect the result. By repeating the argument we can assume that 
each X l A is degenerate in every a £ A, and each X l B degenerate in every 
beB. □ 

Lemma A.2. Let X be a discrete AAB degenerate function. Then X can 
be written as a finite sum 


X Y, X A X B 

j 


of A-degenerate functions X\, and B-degenerate functions Xfi 
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Proof. Let A! = A\B and B' = B\A and D = AdB, so that AAB = A'UB', 
A = A' U D and B = B' U D] note that A !, B' and D are all disjoint. 

For each yr> £ Xd, define a degenerate function ry/o (*; Vd) ■ Xo —* R by 

Vd{xd ; 2/d) = a -1 - l) • 

d&D 

where a = n<;eD l^dl ' (|£d| — !)■ One can verify easily that 

Y Vd(x D ] Vd) = 0 

for any yjj and XD\d, and that 

Y Vd{xd] Vd) 2 = 1; 

2/£>&Ld 

in particular the last expression is independent of xd- 
Now, let A be a discrete C-degenerate function, and using Lemma fA . 1 1 write 
it as 

X = Yh X\, X 3 r , 

i= 1 

where X A , and X B , are respectively A' and B' degenerate. Then for each 
k £ Xd, define X A = X^/rjoi;', k) and X J B = X J B ,rjo(-', k). Clearly each of 
these is degenerate in A = A' U D and B = B' U D respectively. Further, 

Y Y X A' X B> = YY x a' X b'Vd{-; kf 

i=1 i=l k£Xo 

= Y X a' X b> Y k ^ 

1=1 k£Xf) 

= Y, X A' X B' 
i=l 


Lemma A.3. Let X : Xa -A M. be an A-degenerate function, and let A = 
Aie/ A i for some finite collection of sets {Ai :*£/}. Then there exists a 
finite collection of Ai-degenerate functions X\ : Xa, -a R for i £ I,j £ J, 
such that 

* = EIK 

j£j iei 

Proof. This just follows from repeatedly applying Lemma lA.21 □ 
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A. 3 Proof of Lemma 16.61 


Lemma A.4. Let X and y be finite sets, define T = {/ : X —>• T}, and 
take A : y — > R. Then for any A<zy and x £ X , 

/ejr y&A 

f{x)&A 

and if x\ X 2 , 

E A(/(x 2 )) = |A||^|^- 2 -E A (2/)- 

/eJ- 

/(xi)eA 

In particular note that if A is degenerate, the last expression is zero. 

Proof. Clearly if A = y, then 

E = E A (/( x )) 

/GJG /G.F 

f(*)ey 

= \y\ m ~ 1 E x ^ 

y&y 

since there are exactly |T|^ _1 functions in T such that f(x) = y for each 
y £ y. The first result follows in general by setting A \y) = X(y)tA(y)- 
The second result follows by similar combinatorical methods. □ 

Proof of Lemma 1 0. (A It is clear that we only need prove the result for E = 0, 
since we can just incorporate /£ as though they were observable parents of 
C, and the result is the same. 

First consider the case C = {u}; let L = pa g(v) and take any set I\ C L. 
Let A : X v x Xk —> M be a degenerate function, and for each / : Xl x -a 
X v , define 

Kf) = £ Kf(yL,9n(v)),yK)- 

9n(v ) G( v ) 

Then for fixed x v , xl, /„■(„), 


£ w 

£ 

£ 

A( f{yL,gn(v)),yK ) 

f€T v 

/ GJt 

Ul€%l 


f(x L Jn(v))=Xv 

f( X L lfn(v) )“ 

-Xy 9-n(y)^ L '^-K(y) 



= £ 

Vl£%l 

£ 

/G-Fv 

^(/(l/L j 9n(v )) 5 VK ) 
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But since A is degenerate, the inner sum is zero unless both xl = Ul and 
fn{v) = Stt(v) by Lemma fA.41 This leaves 


= E X U{ x l),x k ) 

f(x L )=Xv 

= layl^ll^)!- 1 • A (x v ,xk) 

where the constant represents the number of distinct functions / € T v such 
that f(xL,f w ( v )) = x v . Hence the result holds for C = {c}. 

Now consider a general C C we prove the result by induction on the 
size of C . Given any sterileg(C') C A C C U pag(C), we first claim that 
we can write A = A\ AA2 where for sterileg(C'j) C Ai C C{ U pa g(Ci) for 
i = 1,2 and disjoint non-empty C\, C 2 with C\ U C 2 = C. 

To see this pick C 2 = {rc}, Cj = C \ {re} for some w € sterileg(C), and 
then set A\ = {A U sterileg(C'i)) D {C\ Upag(Ci)) and A 2 = A\ A\. Clearly 
A\ satisfies the required conditions. Since w was chosen to be sterile, w ^ A\ 
and therefore w € A 2 ; in addition, the only elements of A not contained in 
A\ are those which are neither in C\ nor pag(Ci); but since they are in 
C U pa g{C), they must instead be in {«;} U p&g(w). Hence the claim holds. 

Now first suppose that A = Ai • A 2 for degenerate functions Aj : X J 4 i —>• R. 
By the induction hypothesis, we can find degenerate <5i, <^2 such that 

E M/ci) = Cl • Ai(xaJ 

fv(x,f)=X v 

vGCi 

E h(fc 2 ) = C2 ■ h(xA 2 )- 

fv{x,f)=X v 

v£C 2 

Then letting E = Ri\C, 

w*)= EE E 

fv(x,f)=XV fv(x,f)=x v fv(x,f)=X v 

vGE v£Ci vGC 2 

= c ° E E WciM2(/ 2 ) 

fv{x,f)=X v fv(x,f)=x v 

veCi vec 2 

= c 0 E <M/Ci) E ^(/h) 

fc-^&Ev fi&T-w 

fc 1 (x)=x v h(x)=x w 

C0C1C2 * A1 (./' 4 j ) • A 2 (x j 4 2 )- 

However a general degenerate function A : %a —> R can be written as a finite 
linear combination 

A = £ 4 -Ai 

3 


E Wc J- 

fv(x,f)=X v 

vGRi 
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of degenerate functions \\ : Xa, —> M, so the result follows by linearity of 
summations. 

For the final part, note that if v £ C \ B, then the summation over 
will include every function f v £ T v . Then 5 is degenerate and a function of 
f v , so the sum is 0. On the other hand, if v £ (Ri n B)\C, then 6 is not 
a function of f v and summing over all J- v just involves |3£„| extra identical 
terms in the summation. □ 

A. 4 Proof of Lemma 16.81 

Proof of Lemm,a \6.8\. First construct a directed graph II* on Iq in which 
i —> j precisely when there exist Vj £ Rj D C and Vi £ R i 0 C such that Vj £ 
tt( vi). That II* is acyclic follows from the definition of i r, which implicitly 
imposes a partial order on the Rj. 

Let j be the maximal element of Iq', we claim that for any other i £ Ic, 
there is always a directed path in II c from j to i. To see this, note that 
since C is bidirected-connected, there is a bidirected path in Q from some 
Vj £ C Pi Rj to Vi £ C n Ri', given such a path, p, trim it so that only the 
end-points are in C FI Rj and C (1 Ri respectively. 

If p is just Vj Vi, then we are done, since Vj £ n(vi) by definition of n. 
Otherwise, p begins Vi ■<->■ Vk «->• • • • for some Vk £ Rk 0 C, where i > k > j. 
So we can apply an inductive argument to find a path from j to k in 11 ^, 
and the edge Vi Vk implies that k -£ i in 11^. 

Now, U* c is a connected DAG with a unique root node j. so we can simply 
take any singly connected subgraph lie to fulfil the conditions of the lemma. 

□ 
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